29 research outputs found
Learning to adapt: meta-learning approaches for speaker adaptation
The performance of automatic speech recognition systems degrades rapidly when there
is a mismatch between training and testing conditions. One way to compensate for this
mismatch is to adapt an acoustic model to test conditions, for example by performing
speaker adaptation. In this thesis we focus on the discriminative model-based speaker
adaptation approach. The success of this approach relies on having a robust speaker
adaptation procedure – we need to specify which parameters should be adapted and
how they should be adapted. Unfortunately, tuning the speaker adaptation procedure
requires considerable manual effort.
In this thesis we propose to formulate speaker adaptation as a meta-learning task. In
meta-learning, learning occurs on two levels: a learner learns a task specific model and
a meta-learner learns how to train these task specific models. In our case, the learner is
a speaker dependent-model and the meta-learner learns to adapt a speaker-independent
model into the speaker dependent model. By using this formulation, we can automatically learn robust speaker adaptation procedures using gradient descent. In the exper iments, we demonstrate that the meta-learning approach learns competitive adaptation
schedules compared to adaptation procedures with handcrafted hyperparameters.
Subsequently, we show that speaker adaptive training can be formulated as a meta-learning task as well. In contrast to the traditional approach, which maintains and optimises a copy of speaker dependent parameters for each speaker during training, we
embed the gradient based adaptation directly into the training of the acoustic model.
We hypothesise that this formulation should steer the training of the acoustic model
into finding parameters better suited for test-time speaker adaptation. We experimentally compare our approach with test-only adaptation of a standard baseline model and
with SAT-LHUC, which represents a traditional speaker adaptive training method. We
show that the meta-learning speaker-adaptive training approach achieves comparable
results with SAT-LHUC. However, neither the meta-learning approach nor SAT-LHUC
outperforms the baseline approach after adaptation.
Consequently, we run a series of experimental ablations to determine why SAT-LHUC does not yield any improvements compared to the baseline approach. In these
experiments we explored multiple factors such as using various neural network architectures, normalisation techniques, activation functions or optimisers. We find that
SAT-LHUC interferes with batch normalisation, and that it benefits from an increased
hidden layer width and an increased model size. However, the baseline model benefits from increased capacity too, therefore in order to obtain the best model it is still
favourable to train a speaker independent model with batch normalisation. As such, an
effective way of training state-of-the-art SAT-LHUC models remains an open question.
Finally, we show that the performance of unsupervised speaker adaptation can be
further improved by using discriminative adaptation with lattices as supervision obtained from a first pass decoding, instead of traditionally used one-best path tran scriptions. We find that this proposed approach enables many more parameters to
be adapted without overfitting being observed, and is successful even when the initial
transcription has a WER in excess of 50%
Deciphering Speech: a Zero-Resource Approach to Cross-Lingual Transfer in ASR
We present a method for cross-lingual training an ASR system using absolutely
no transcribed training data from the target language, and with no phonetic
knowledge of the language in question. Our approach uses a novel application of
a decipherment algorithm, which operates given only unpaired speech and text
data from the target language. We apply this decipherment to phone sequences
generated by a universal phone recogniser trained on out-of-language speech
corpora, which we follow with flat-start semi-supervised training to obtain an
acoustic model for the new language. To the best of our knowledge, this is the
first practical approach to zero-resource cross-lingual ASR which does not rely
on any hand-crafted phonetic information. We carry out experiments on read
speech from the GlobalPhone corpus, and show that it is possible to learn a
decipherment model on just 20 minutes of data from the target language. When
used to generate pseudo-labels for semi-supervised training, we obtain WERs
that range from 32.5% to just 1.9% absolute worse than the equivalent fully
supervised models trained on the same data.Comment: Submitted to Interspeech 202
Speaker adaptive training using model agnostic meta-learning
Speaker adaptive training (SAT) of neural network acoustic models learns
models in a way that makes them more suitable for adaptation to test
conditions. Conventionally, model-based speaker adaptive training is performed
by having a set of speaker dependent parameters that are jointly optimised with
speaker independent parameters in order to remove speaker variation. However,
this does not scale well if all neural network weights are to be adapted to the
speaker. In this paper we formulate speaker adaptive training as a
meta-learning task, in which an adaptation process using gradient descent is
encoded directly into the training of the model. We compare our approach with
test-only adaptation of a standard baseline model and a SAT-LHUC model with a
learned speaker adaptation schedule and demonstrate that the meta-learning
approach achieves comparable results.Comment: Accepted to IEEE ASRU 201
Acoustic Word Embeddings for Untranscribed Target Languages with Continued Pretraining and Learned Pooling
Acoustic word embeddings are typically created by training a pooling function
using pairs of word-like units. For unsupervised systems, these are mined using
k-nearest neighbor (KNN) search, which is slow. Recently, mean-pooled
representations from a pre-trained self-supervised English model were suggested
as a promising alternative, but their performance on target languages was not
fully competitive. Here, we explore improvements to both approaches: we use
continued pre-training to adapt the self-supervised model to the target
language, and we use a multilingual phone recognizer (MPR) to mine phone n-gram
pairs for training the pooling function. Evaluating on four languages, we show
that both methods outperform a recent approach on word discrimination.
Moreover, the MPR method is orders of magnitude faster than KNN, and is highly
data efficient. We also show a small improvement from performing learned
pooling on top of the continued pre-trained representations.Comment: Accepted to Interspeech 202
ASR and Emotional Speech: A Word-Level Investigation of the Mutual Impact of Speech and Emotion Recognition
In Speech Emotion Recognition (SER), textual data is often used alongside
audio signals to address their inherent variability. However, the reliance on
human annotated text in most research hinders the development of practical SER
systems. To overcome this challenge, we investigate how Automatic Speech
Recognition (ASR) performs on emotional speech by analyzing the ASR performance
on emotion corpora and examining the distribution of word errors and confidence
scores in ASR transcripts to gain insight into how emotion affects ASR. We
utilize four ASR systems, namely Kaldi ASR, wav2vec2, Conformer, and Whisper,
and three corpora: IEMOCAP, MOSI, and MELD to ensure generalizability.
Additionally, we conduct text-based SER on ASR transcripts with increasing word
error rates to investigate how ASR affects SER. The objective of this study is
to uncover the relationship and mutual impact of ASR and SER, in order to
facilitate ASR adaptation to emotional speech and the use of SER in real world.Comment: Accepted to INTERSPEECH 202